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Experimental genomics in combination with the growing body of sequence information promise to 
revolutionize the way cells and cellular processes are studied. Information on genomic sequence can be used " 
experimentally with high-density DNA arrays that allow complex mixtures of RNA and DNA to be interrogated 
in a parallel and quantitative fashion. DNA arrays can be used for many different purposes, most prominently 
to measure levels of gene expression (messenger RNA abundance) for tens of thousands of genes 
simultaneously. Measurements of gene expression and other applications of arrays embody much of what is 
implied by the term 'genomics'; they are broad in scope, large in scale, and take advantage of all available 
sequence information for experimental design and data interpretation in pursuit of biological understanding. 



Biological and biomedical research is in the 
midst of a significant transition that is being 
driven by two primary factors: the massive 
increase in the amount of DNA sequence 
information and the development of 
technologies to exploit its use. Consequently, we find 
ourselves at a time when new types of experiments are 
possible, and observations, analyses and discoveries are 
being made on an unprecedented scale. Over the past few 
years, more than 30 organisms have had their genomes 
completely sequenced, with another 1 00 or so in progress 
(see www.tigr.org or genomes@ncbi.nlm.nih.gov for 
a list). At least partial sequence has been obtained for 
tens of thousands of mouse, rat and human genes, and 
the sequence of two entire human chromosomes 
(chromosomes 21 and 22) has been determined 1,2 . Within 
the year, a large proportion of the human genome will be 
deciphered, in both public and private efforts, and the 
complete sequence of the mouse and other animal and 
plant genomes will undoubtedly follow close behind. 
Unfortunately, the billions of bases of DNA sequence do 
not tell us what all the genes do, how cells work, how cells 
form organisms, what goes wrong in disease, how we age 
or how to develop a drug. This is where functional 
genomics comes into play. The purpose of genomics is to 
understand biology, not simply to identify the component 
parts, and the experimental and computational methods 
take advantage of as much sequence information as 
possible. In this sense, functional genomics is less a specific 
project or programme than it is a mindset and general 
approach to problems. The goal is not simply to provide a 
catalogue of all the genes and information about their 
functions, but to understand how the components work 
together to comprise functioning cells and organisms. 

To take full advantage of the large and rapidly increasing 
body of sequence information, new technologies are 
required. Among the most powerful and versatile tools for 
genomics are high-density arrays of oligonucleotides or com- 
plementary DNAs. Nucleic acid arrays work by hybridization 
of labelled RNA or DNA in solution to DNA molecules 
attached at specific locations on a surface. The hybridization 
of a sample to an array is, in effect, a highly parallel search by 
each molecule for a matching partner on an 'affinity matrix', 
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with the eventual pairings of molecules on the surface 
determined by the rules of molecular recognition. Arrays of 
nucleic acids have been used for biological experiments for 
manyyears 3-8 . Traditionally, the arrays consisted of fragments 
of DNA, often with unknown sequence, spotted on a porous 
membrane (usually nylon). The arrayed DNA fragments 
often came from cDNA, genomic DNA or plasmid libraries, 
and the hybridized material was often labelled with a radioac- 
tive group. Recently, the use of glass as a substrate and fluores- 
cence for detection, together with the development of new 
technologies for synthesizing or depositing nucleic acids on 
glass slides at very high densities, have allowed the miniatur- 
ization of nucleic acid arrays with concomitant increases in 
experimental efficiency and information content 9 " 14 (Fig. 1 ). 

While making arrays with more than several hundred 
elements was until recently a significant technical 
achievement, arrays with more than 250,000 different 
oligonucleotide probes or 10,000 different cDNAs per 
square centimetre can now be produced in significant 
numbers 1 5,1S . Although it is possible to synthesize or deposit 
DNA fragments of unknown sequence, the most common 
implementation is to design arrays based on specific 
sequence information, a process sometimes referred to as 
'downloading the genome onto a chip* (Fig. 1). There are 
several variations on this basic technical theme: the 
hybridization reaction may be driven (for example, by an 
electric field) 1 7 ' 18 ; other detection methods 19 besides fluores- 
cence can be used; and the surface may be made of materials 
other than glass such as plastic, silicon, gold, a gel or 
membrane, or may even be comprised of beads at the ends of 
fibre-optic bundles 20 " 22 . Nonetheless, the key elements of 
parallel hybridization to localized, surface-bound nucleic 
acid probes and subsequent counting of bound molecules 
are ubiquitous, and high -density arrays of nucleic acids on 
glass (often called DNA microarrays, oligonucleotide 
arrays, GeneChip arrays, or simply 'chips') and their 
biological uses will be the focus of this review. 

Global gene expression experiments 

One of the most important applications for arrays so far is the 
monitoringofgeneexpression(mRNAabundance).The col- 
lection of genes that are expressed or transcribed from 
genomic DNA, sometimes referred to as the expression 
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profile or the 'transcriptome', is a major determinant of cellular pheno- 
type and function. The transcription of genomic DNA to produce 
mRNA is the first step in the process of protein synthesis, and 
differences in gene expression are responsible for both morphological 
and phenotypic differences as well as indicative of cellular responses to 
environmental stimuli and perturbations. Unlike the genome, the 
transcriptome is highly dynamic and changes rapidly and dramatically 
in response to perturbations or even during normal cellular events 
such as DNA replication and cell division 23,24 . In terms of understand- 
ing the function of genes, knowing when, where and to what extent a 
gene is expressed is central to understanding the activity and biological 
roles of its encoded protein. In addition, changes in the multi-gene 
patterns of expression can provide clues about regulatory mechanisms 
and broader cellular functions and biochemical pathways. In the 
context of human health and treatment, the knowledge gained from 
these types of measurements can help determine the causes and conse- 
quences of disease, how drugs and drug candidates work in cells 
and organisms, and what gene products might have therapeutic uses 
themselves or maybe appropriate targets for therapeutic in tervention. 

Past discussions of arrays have often centred on technical issues and 
specific performance characteristics 25 . Now that nucleic acid arrays have 
been constructed for many different organisms 14,26 " 29 and used success- 
fully to measure transcript abundance in a host of different experi- 
ments, the focus of interest has thankfully shifted. Investigators are now 
more concerned with questions concerning experimental design, data 
analysis, the use of small amounts of mRNA from limited sources, the 
best ways to extract biological meaning from the results, pathway and 
cell-circuitry modelling, and medical uses of expression patterns. 

Array-based gene expression monitoring 

One way to think of measurements with arrays is that they are simply 
a more powerful substitute for conventional methods of evaluating 



mRNA abundance. For some early experiments, only a relatively 
small set of genes, which were thought to be important to a process, 
were included on the arrays 12,30 . However, such experiments did not 
capitalize on the arrays' potential: a key advantage of using arrays, 
especially those that contain probes for tens of thousands of different 
genes, is that it is not necessary to guess what the important genes or 
mechanisms are in advance. Instead of looking only under the 
proverbial lamppost, a broader, more complete and less biased view 
of the cellular response is obtained (Figs 2, 3). 

The breadth of array-based observations almost guarantees that 
surprising findings will be made. A recent study measured the 
transcriptional changes that occur as cells progress through the 
normal cell-division cycle in humans for approximately 40,000 genes 
(R. J. Cho etaly unpublished results). In addition to the induction of 
DNA replication genes and genes involved with cell-cycle control and 
chromosome segregation that would be expected at specific stages in 
the cell cycle, a large collection of genes involved with smooth muscle 
function, apoptosis and intercellular adhesion and cell motility were 
found to be upregulated during a specific phase. The expected results 
act effectively as internal controls that provide a certain amount of 
validation (and comfort), while new information is obtained by a 
systematic search of a larger part of 'gene space'. In addition, because 
arrays often contain probes for genes of unknown function (and 
often with only partial sequence information), any outcome for these 
could be considered, in some sense, both surprising and novel 
(although clearly requiring further characterization). 

Other gene expression methods 

Not surprisingly, there are other ways to measure mRNA abundance, 
gene expression and changes in gene expression. For measuring gene 
expression at the level of mRNA, northern blots, polymerase chain 
reaction after reverse transcription of RNA (RT-PCR), nuclease 
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protection, cDNA sequencing, clone hybridization, differential 
display 3 ', subtractive hybridization, cDNA fragment fingerprinting 32-35 
and serial analysis of gene expression (SAGE) 36 have all been put to good 
use to measure the expression levels of specific genes, characterize 
global expression profiles or to screen for significant differences in 
mRNA abundance. But if messenger RNA is only an intermediate on 
the way to production of the functional protein products, why measure 
mRNA at all? One reason is simply that protein-based approaches are 
generally more difficult, less sensitive and have a lower throughput than 
RNA-based ones. But more importantly, mRNA levels are immensely 
informative about cell state and the activity of genes, and for most 
genes, changes in mRNA abundance are related to changes in protein 
abundance. Because of its importance, however, many methods have 
been developed for monitoring protein levels either directly or 
indirectly (see review in this issue by Pandey and Mann, pages 
837-846). These include western blots, two-dimensional gels, methods 
based on protein or peptide chromatographic separation and mass 
spectrometry detection 37 "^ 0 , methods that use specific protein-fusion 
reporter constructs and colorimetric readouts 4 M4 l and methods based 
oncharacterizationofactivelytranslated,polysomalmRNA 45 ^ 7 . 

The importance of the protein-based methods is that they measure 
the final expression product rather than an intermediate. In addition, 
some of them enable the detection of post-translational protein modifi- 
cations (for example, phosphorylation and glycosylation) and protein 
complexes, and in some cases, yield information about protein localiza- 
tion, none of which are obtained directly by measurements of mRNA 
There is no question that protein- and RNA-based measurements are 
complementary, and that protein-based methods are important as they 
measure observables that are not readily detected in other ways. 

Human disease, gene expression and discovery 

Genomics and gene expression experiments are sometimes derided 
as 'fishing expeditions'. Our view is that there is nothing wrong with a 
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fishing expedition 4 * if what you are after is 'fish', such as new genes 
involved in a pathway, potential drug targets or expression markers 
that can be used in a predictive or diagnostic fashion. Because the 
arrays can be designed and made on the basis of only partial 
'sequence information, it is possible to include genes in a survey that 
are completely uncharacterized. In many ways, the spirit of this 
approach is more akin to that of classical genetics in which muta- 
tions are made broadly and at random (not only in specific genes), 
and screens or selections are set up to discover mutants with an 
interesting phenotype, which then leads to further characterization 
of specific genes. 

Such broad discovery experiments are probably better described 
as 'question-driven' rather than hypothesis-driven in the conven- 
tional sense. But that is not to diminish their value for understanding 
basic biological processes and even for understanding and treating 
human disease. For example, by analysing multiple samples obtained 
from individuals with and without acute leukaemia or diffuse large 
B-cell lymphoma, gene expression (mRNA) markers were discov- 
ered that could be used in the classification of these cancers 49,50 . The 
importance of monitoring a large number of genes was well illustrat- 
ed in these studies. Golub etal 49 found that reliable predictions could 
not be made based on any single gene, but that predictions based on 
the expression levels of 50 genes (selected from the more than 6,000 
monitored on the arrays) were highly accurate. The results of both of 
these studies indicate that measurements with more individuals and 
more genes will be needed to identify robust expression markers that . 
are predictive of clinical outcome. But even with the limited initial 
data it was possible to help clarify an unusual case (classic leukaemia 
presentation but atypical morphology) and to use this information 
to guide the patient's clinical care. 

It is also possible to take a related approach to help understand 
what goes wrong in cancerous, transformed cells and to identify 
the genes responsible for disease. Causative effects and potential 
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therapeutic targets can be identified by determining which genes are 
upregulated in different tumour types si ~ 55 , and specific candidate 
genes can be intentionally overexpressed in cell lines or cells treated 
with growth factors in order to identify downstream target genes 
and to explore signalling pathways* 5-58 . Tumorigenesis is often 
accompanied by changes in chromosomal DNA, such as genetic 
rearrangements, amplifications or losses of particular chromosomal 
loci, and developmental abnormalities, such as Down's or Turner's 
syndrome, may arise from aberrations in DNA copy number. 
Because genomic DNA can be interrogated in much the same way as 
mRNA , comparisons of the copy number of genomic regions or the 
genotype of genetic markers can be used to detect chromosomal 
regions and genes that are amplified or deleted in cancerous or 
pre-cancerous cells. By using arrays containing probes for a large 
number of genes or polymorphic markers, changes in DNA copy 
number have been detected in both breast cancer cell lines and in 
tumours 59-61 . The identification of when and where changes in copy 
number or chromosomal rearrangements have occurred can be used 
in both the classification of cancer types and the identification of 
regions that may harbour tumour-suppressor genes. 

Whole-genome hypotheses 

The use of genomics tools such as arrays does not, of course, preclude 
hypothesis-driven research. For fully sequenced organisms, arrays 
containing probes for every annotated gene in the genome have been 
produced 14,26 . With these one can ask, for example, whether a 
transcription factor has a global role in transcription (affecting all 
genes) or a specific role (affecting only some). Holstege etal!* 2 used 
this type of application in a genome- wide expression analysis in yeast 
to functionally dissect the machinery of transcription initiation. 
Similarly, genes located near the ends of chromosomes in yeast (as 
weii as genes at the ma ting- type locus) are known to be transcription- 
ally 'silent'. Full genome arrays allow the chromosomal landscape of 
silencing to be mapped, and make it possible to test whether what is 
true for a handful of well-studied genes near the telomeres is true for 
all telomeric genes, and whether any centromere-proximal genes are 
also transcriptionally silenced 63 . 

It is important to emphasize that these new, parallel approaches 
do not replace conventional methods. Standard methods such as 
northern blots, western blots or RT-PCR are simply used in a more 
targeted fashion to complement the broader measurements and to 
follow-up on the genes, pathways and mechanisms implicated by the 
array results. Because the incidence of false-positive results can be 
made sufficiently low (see Fig. 2), it is not necessary to independently 
confirm every change for the results to be valid and trustworthy, 
especially if conclusions are based on changes in sets of genes rather 
than individual genes. More detailed follow-up is recommended if a 
gene is being chosen, for example, as a drug target, as a candidate for 
population genetics studies, or as the target for the construction of a 
knockout mouse. 

Does gene expression indicate function? 

As additional, uncharacterized open reading frames (ORFs) are 
identified in different organisms by the various genome sequencing 
projects, researchers have begun to ask whether the expression pat- 
tern for a gene can be used to predict the functional role of its protein 
product. An increasingly common approach involves using the gene 
expression behaviour observed over multiple experiments to first 
cluster genes together into groups (see Fig. 3), either by manual 
examination of the data 24 , or by using statistical methods such as self- 
organizing maps 64 , K- tuple means clustering or hierarchical cluster- 
ing 23,65 ' 66 . The basic assumption underlying this approach is that 
genes with similar expression behaviour (for example, increasing 
and decreasing together under similar circumstances) are likely to be 
related functionally, in this way, genes without previous functional 
assignments can be given tentative assignments or assigned a role in a 
biological process based on the known functions of genes in the same 
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expression cluster (that is, the concept of 'guilt-by-association ). The 
validity of this approach has been demonstrated for many genes in 
Saccharomyces cerevisiae, a simple organism for which the entire 
genomic sequence and the functional roles of approximately 60% of 
the genes are known 24 - 65 ' 67 (Fig. 4). Although not logically rigorous, 
the utility of the guilt-by-association approach has been demonstrat- 
ed, as genes already known to be related do, in fact, tend to cluster 
together based on their experimentally determined expression pat- 
terns (Fig. 4). The approach is made more systematic and statistically 
sound by calculating the probability that the observed functional 
distribution of differentially expressed genes could have happened by 
chance. The application of statistical rigour is essential to avoid 
overly subjective interpretations of the results based on the predispo- 
sitions, prior knowledge and interests of the individual researcher. 

A tentative functional assignment may not be much more than a 
low-resolution description or general classification. Descriptions of 
this type are similar to those that come out of more classical genetic 
screens and selections, which have provided the vast majority of 
functional annotations to date — they indicate that genes are 
involved with a particular cellular phenotype and that they are likely 
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to be involved with a certain set of other genes and processes. This 
allows researchers to focus attention on a smaller subset of genes, 
many of which may not have been obvious candidates in the absence 
of the global expression observations. This overall approach high- 
lights the importance of functional annotation and careful curation 
of existing sequence, function and knowledge databases (see below). 
Expression results covering thousands or even tens of thousands 
of genes and expressed sequence tags (ESTs) will be only partly 
interpretable given the functional and biological information 
available at the time they are initially generated. Our ability to extract 
knowledge from measurements of global gene expression tends to 
increase with time as additional information becomes available, and 
results can be subjected to further interrogation in the light of new 
information, observations, questions and hypotheses. 

Gene expression and the regulation of transcription 

When information on the complete genome sequence is available, as 
is the case for increasing numbers of small and even larger genomes, 
gene expression data can be used to identify new ris-regulatory 
elements (genomic sequence motifs that are over-represented in the 
genomic DNA in the vicinity of similarly behaving genes) and 
regulons' (sets of co-regulated genes), the basic units of the underly- 
ing cellular circuitry (Fig. 3d). In feet, the correlation between the 
presence of specific sequence motifs in promoter regions and gene 
expression patterns may be stronger than the correlation between 
functional categories and gene expression patterns. In yeast studies, 
more than 50% of the genes that are transcribed in a cell cycle- 
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specific manner and whose transcript abundance peaks in the Gl 
phase of the cell cycle have an MCB (Mlu cell-cycle box) within 500 
base pairs (bp) of their translation^ start site 24,6869 . Similar observa- 
tions have been made for yeast genes whose transcription is induced 
during sporulation 67 . In addition, new cts-regulatory elements may 
be revealed by examining classes of co-regulated genes (Fig. 3d). With 
sufficiently large numbers of experimental observations of expres- 
sion behaviour, the boundaries and all functioning sequence variants 
of as-regulatory elements might be predicted without the need for 
the more conventional approach using site-directed mutagenesis 
('promoter bashing'). The expression-based method will be especial- 
ly valuable in exotic organisms, such as Plasmodium falciparum, the 
causative agent for malaria, for which experimental identification or 
verification of transcription factor binding sites is difficult. 

Gene expression profiles as 'fingerprints' 

An often overlooked aspect of measurements of global gene expres- 
sion is that the sequence or even the origin of the arrayed probes does 
not need to be known to make interesting observations — the 
complex profiles, consisting of thousands of individual observations, 
can serve as transcriptional 'fingerprints'. The fingerprints can be 
used for classification purposes or as tests for relatedness, in a similar 
manner to the way in which DNA fingerprints are used in paternity 
testing. In one example, transcriptional fingerprints have been used 
to determine the target of a drug 70 . The basic idea is that if a drug 
interacts with and inactivates a specific cellular protein, the pheno- 
type of the drug-treated cell should be very similar to the phenotype 
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of a cell in which the gene encoding the protein has been genetically 
inactivated, usually through mutation. Thus, by comparing the 
expression profile of a drug-treated cell to the profiles of cells in 
which single genes have been individually inactivated, specific 
mutants can be matched to specific drugs, and therefore, targets to 
drugs. In a demonstration of this concept, the gene product of the 
his3 gene was identified correctly as the target of 3-aminotriazole 70 . 
Similarly, profiles have been used in the classification of cancers and 
the classification schemes did not depend on any specific informa- 
tion about the genes involved 49,50 , although that information can be 
used to draw further biological and mechanistic conclusions. Finally, 
expression profiles can be used to classify drugs and their mode of 
action. For example, the functional similarity and specificity of 
different purine analogues have been determined by comparing the 
genome- wide effects on treated yeast, murine and human cells 7 1 * 72 . 

Expression measurements from small amounts of RNA 

An important frontier in the development of gene expression 
technology involves reduction of the required amount of starting 
material. Most array-based expression measurements are done using 
RNA from a million or more cells, and obtaining such a relatively 
large sample is not a problem in many types of studies (for example, 
litres of yeast cells can be grown easily). However, in some cases, it is 
important or even necessary to use fewer cells, as when using a small 
organ from a fly or worm, sorted cells that express a rare marker, or 
laser-capture microdissected 73 ~" tumour tissue. Efficient and 
reproducible mRNA amplification methods are required, and there 
are two primary approaches that show significant promise. The first 
is a PCR-based approach that has been used to make single-cell cDNA 
libraries 76 "" 78 . We have found that the amplification is efficient and 
reproducible, but that the relative abundance of the cDNA products 
is not well correlated with the original mKJMA levels (D. Giang and 
D. J. Lockhart, unpublished results), although normalization 
and referencing strategies can be used (D. de Graaf and E. Lander, 
personal communication). 

The second approach avoids PCR altogether and uses multiple 
rounds of linear amplification based on cDNA synthesis and a 
template-directed in vitro transcription (IVT) reaction 79-81 . This 
method has been used to characterize mRNA from single live 
neurons 81 and even subcellular regions, and more recently to amplify 
mRNA from 500 to 1,000 cells from microdissected brain tissues for 
hybridization to spotted cDNA arrays 82 . We have found that the 
multiple-round cDNA/IVT amplification method produces suffi- 
cient quantities of labelled material starting with as little as 1-50 ng 
total RNA, is highly reproducible (correlation coefficients greater 
than 0.97), and introduces much less quantitative bias than 
PCR-based amplification (D. Giang and D. J. Lockhart, unpublished 
results). These amplification methods facilitate the possibility of 
monitoring large number of genes starting with very limited 
amounts of RNA and very few cells. The combination of arrays 
and powerful amplification strategies promises to be especially 
important for studies that use human biopsy material from 
inhomogeneous tissue, and in the areas of developmental biology, 
immunology and neurobiology. 

Genome analysis using arrays 

Although nucleic acid arrays are often equated with gene expression 
analysis, they may be used to collect much of the data that are 
obtained presently by Southern or northern blot hybridization tech- 
niques, but in a more highly parallel fashion (Figs 5, 6). Their utility 
in polymorphism detection and genotyping is described elsewhere 
(see review in this issue by Roses, pages 857-865), but there are many 
additional uses for these versatile tools. For example, genomic DNA 
samples can be manipulated experimentally to select for particular 
regions before hybridization to obtain specific types of information. 
In yeast, the location of hundreds of chromosomal origins of replica- 
tion can be determined in parallel by enriching for early-replicating 
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regions using a variation of the Meselsohn-Stahl procedure and then 
hybridizing the resulting DNA to full genome arrays (E. A. Winzeler 
et al.y unpublished results). Similarly, as probes for more intergenic 
regions are synthesized on arrays, it becomes possible to identify 
protein-binding sites: fragmented chromatin can be crosslinked to a 
protein and then immunoprecipitated with an antibody to that pro- 
tein. The DNA fraction of the immunoprecipitate can be labelled and 
hybridized to identify the approximate location of the binding site. In 
addition, full genome arrays can be used in the analysis of plasmid 
libraries in genetic selections such as two-hybrid screens 83 or, in 
principal, for any other type of experiment in which the information 
is contained in the form of RNA or DNA. Arrays also have 
applications in biophysical chemistry and biochemistry. For 
example, single-stranded DNA arrays were converted enzymatically 
into arrays of double-stranded DNA to characterize the interactions 
of proteins, and potentially other types of molecules, with double- 
stranded DNA 84 . 

Gene expression and celi circuitry 

Is it reasonable to consider the cell as a complex analogue circuit, and 
to attempt to reverse-engineer the cell circuitry much like an electri- 
cal engineer would do by measuring currents and voltages at a variety 
of nodes and under a variety of input conditions? In the case of 
the cell, expression levels and expression changes might take the place 
of electrical measurements, and could be measured under many 
experimental conditions. Is it possible that a genetic or cellular circuit 
of reasonable complexity could be adequately decoded or modelled, 
and if so, how many and what types of measurements and perturba- 
tions (or 'inputs') would be required so that the problem was not 
hopelessly underdetermined 85-89 ? Reasonably detailed circuit 
diagrams can be drawn and simulations of simple genetic circuits 
have been performed for systems of low complexity (for example, the 
lytic cycle of phage lambda, and simple control networks in 
Escherichia coli bacteria 90 ). But the situation is considerably more 
complex in the case of a eukaryotic cell. Using yeast as an example, if 
we assume that the expression level for each gene can be one of only 
four levels (off, low, medium or high), then if the 6,200 yeast genes 
behave independently, there are 6,200 4 , or -1.5 x 10 15 possible 
expression states. Of course, the expression levels of different genes 
are not all independent of one another, and there are some states that 
are physically unrealistic (for example, all genes 'off' or all genes 
'high*), but the number of possible cellular configurations is very 
large. In addition, coupling between circuit components, the effects 
of nonlinear feedback, redundancy and even noise and stochastic 
events make simulating a circuit of this complexity a rather daunting 
task, and not all relationships and cellular events are reflected at the 
level of mRNA abundance. 

Least clear may be what types of perturbations or inputs are likely 
to be the most informative in terms of defining the relationships 
between genes and pathways, and what might be a minimal set of 
'orthogonal perturbations' (treatments, genetic manipulations or 
growth conditions that have minimal overlap in their direct cellular 
effects) . Certainly it is possible to delete every yeast gene one at a time 
(or even several at a time) and measure the expression profile for each 
mutant strain under a set of different growth conditions 70,91 . It is 
also possible to grow yeast on a matrix of thousands of different 
conditions and measure the resulting expression profiles for a range 
of mutated strains. It is clear that extensive experiments of this type, 
combined with information from other measurements such as 
yeast two-hybrid protein-protein interaction screens 92 , and 
measurements of protein levels, modification states and cellular 
localization will lead to useful groupings of genes in terms of function 
and regulation (that is, a genetic, molecular and functional taxono- 
my), and to supply some reasonably detailed information about the 
relationships between certain genes and pathways. In addition, sets 
of perturbations directed towards specific functions and 
cellular processes will allow higher-resolution and even mechanistic 
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information for significant parts of the overaD circuitry* 2,93 . However, 
given the tremendous complexity of the system, it is unlikely that a 
complete and detailed cellular circuit diagram will result for even 
single-celled eukaryotes such as yeast any time in the near future. But 
that is not to say that construction of even first-order global models 
and semi-quantitative circuit diagrams is not extremely useful. Such 
models serve to organize current information, relationships and 
hypotheses, and can be tremendously helpful for testing new 
hypotheses, interpreting new observations, designing new experi- 
ments and predicting the likely effects of particular chemical, genetic 
or cellular perturbations. They also serve as a scaffold upon which to 
build higher-resolution, more quantitative and complete models. 

Can we have too much data? 

Contrary to what is sometimes thought, the biggest problem for 
making sense of the extensive results from genomics experiments is 
not that there is too much data or that there are insufficiently sophis- 
ticated algorithms and software tools for querying and visualizing 
data on this scale. Larger problems of data management and analysis 
have been solved by airlines, financial institutions, global retailers, 
high-energy and plasma physicists, the military and global weather 



types of studies is that a sufficient number of experiments be 
performed across multiple individuals and multiple tissue or tumour 
samples to account for individual variation and possible tissue 
inhomogeneity. Furthermore, confidence in the results is increased 
as conclusions are based on sets of genes that show a consistent 
response and that are consistently different between two or more sets 
ofresults 49 * 50 - 52 - 53 ' 95 . 

Making sense of genomic results 

Although the difficulties of sample collection, data collection and 
experimental design should not be underestimated, one of the most 
challenging aspects of gene expression analysis is making sense of the 
vast quantities of data and extracting conclusions and hypotheses that 
are biologically meaningful. From experiments on global gene expres- 
sion, we may obtain data for thousands of genes, often forcing us to 
consider processes, functions and mechanisms about which we know 
very little. Thus, there is a need for more sophisticated systems of 
knowledge representation (or 'knowledge bases') that organize the 
data, facts, observations, relationships and even hypotheses that form 
the basis of our current scientific understanding. This information 
needs to be more than just stored; it needs to be available in a 



predictors,amongothers.Itisoftenbeneficialtohavealargenumber way that helps scientists understand and interpret the often 
of measurements and sometimes more data make it possible to complex observations that are becoming increasingly easy to make. 

Unfortunately, the fact is that the scientific literature has been 



analyse results that might otherwise have been too 'messy', and to 
detect patterns and relationships that would not have been obvious 
or have sufficient statistical significance with smaller data sets. In 
many types of studies, it is not possible to control completely all 
variables, and the individual differences between common sample 
types may be significant because of experimental difficulties (for 
example, tissue inhomogeneity or variations in sample procedures) 
or individual genetic variation ( for example, different patients or dif- 
ferent tumours). But such factors do not preclude the discovery of 
some genes that clearly 'cluster' or differentiate between the sample 
sets. For example, meaningful results can be extracted from the 
analysis of human tissue collected at different hospitals, by different 
surgeons and at different times. An essential requirement in these 
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somewhat haphazardly built, without the benefit of a controlled or 
restricted vocabulary and a well defined semantic and grammar. To 
take full advantage of the abilities of the new technologies and the 
rapidly increasing amount of sequence information it is absolutely 
essential to incorporate the facts, ideas, connections, observations 
and so forth, which exist in the scientific literature and in the 
minds of scientists, into a form that is systematic, organized, 
linked, visualized and searchable. This clearly requires a great 
deal of dedicated, systematic human effort, but progress has 
been made. Databases such as the Saccharomyces Genome 
Database (SGD: genome-www.stanford.edu/Saccharomyces), the 
Munich Information Center for Protein Sequences (MIPS: 
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www.mips.biochem.mpg.de), WormBase (www.wormbase.org), the 
Kyoto Encyclopedia of Genes and Genomes (KEGG: 
www.genome.ad.jp/kegg), the Encyclopedia of £. coli Genes and 
Metabolism (EcoCyc: http://ecocycpanbio.com/ecocyc) and FlyBase 
(flybase.bio.Indiana.edu/) incorporate sequence, genetics, gene 
expression, homology, regulation, function and phenotype informa- 
tion in an organized and useable form 9<M02 . But a step beyond databases 
of this type are ones in which concepts as well as facts are more fully 
integrated and related, allowing connections to be made between 
initially disparate observations and information, and across 
organisms. It is conceivable that the next step will evolve to the level of a 
biological 'expert system*, not unlike the expert system ('Big Blue') that 
IBM scientists and engineers built to play chess (successfully) against 
the world's best chess player. Despite the potential for advancement 
on this front, it seems unlikely that computational tools will ever 
replace the trained human brain when it comes to making biological 
sense of new results. However, the appropriate tools are needed to bring 
information and relationships to scientists fingertips so that the 
most insightful questions can be asked and the most meaningful 
interpretations made. 

Conclusion 

For these array-based methods to become truly revolutionary, they 
must become an integral part of the daily activities of the typical 
molecular biology laboratory. Despite their impressive and rapidly 
growing r£sume\ these technologies are still in their infancy, with 
plenty of room for technical improvements, further development, 
and more widespread acceptance and accessibility. We expect that the 
pattern of development and use of arrays and other parallel genomic 
methodologies will be similar to that seen for computers and other 
high-tech electronic devices, which started out as exotic and expen- 
sive tools in the hands of the few developers and early adopters, and 
then moved quickly to become easier to use, more available, less 
expensive and more powerful, both individually and because of their 
ubiquity. In fact, nucleic acid array-based methods that previously 
seemed exotic, and too expensive, are becoming routine as indicated 
by the huge increase in the number of publications that incorporate 
data obtained in this way. Despite the relative youth of these 
approaches, the achievement of technical goals that would have 
seemed like science fiction only a few years ago is now clearly in view. 
For example, we expect that measuring the expression level of essen- 
tially every gene (including variant splice forms) on an array or two 
starting with RNA from a small number of cells, or even a single cell, 
will soon be possible owing to advances in single-cell handling and 
RNA amplification methods, the output of large-scale sequencing 
efforts and achievable advances in array technology. In the future, 
arrays of peptides, proteins, small molecules, mRNAs, clones, tissues, 
cells and even multicellular organisms such as the nematode worm 
Caenorhabditis elegans may also become common. The combined 
use of all of these highly parallel methods, along with sequence 
information, computational tools, integrated knowledge databases, 
and the traditional approaches of biology, biochemistry, chemistry, 
physics, mathematics and genetics, increases the hopes of 
understanding the function and regulation of all genes and proteins, 
deciphering the underlying workings of the cell, determining the 
mechanisms of disease, and discovering ways to intervene with or 
prevent aberrant cellular processes in order to improve human health 
and well-being. □ 
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